This pipeline was created using the following resources:

Within the RStudio file (.rmd), each code chunk can be executed by clicking the Run button within each chunk or by placing the cursor inside it and pressing Cntrl+Shift+Enter.

After installing R and RStudio, the following Bioconductor R packages are necessary for processing RNASeq raw data, conducting differential expression analysis and data visualization:

For installing packages, execute the code on the Installation section in the corresponding Bioconductor page.

For instance, for installing DESeq2:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DESeq2")

The remaining packages are part of the Comprehensive R Archive Network (CRAN), which is the official R packages repository. These can be installed in RStudio by going to ToolsInstall Packages and in the Install from option select Repository (CRAN), followed by specifying the intended packages:


After installing every required package, these libraries should be loaded into RStudio so that their commands and functions are available to use. To do so, enter:

#Loading libraries
library("DESeq2")
library("edgeR")
library("AnnotationDbi")
library("org.Hs.eg.db")
library("ggplot2")
library("ggalt")
library("ggrepel")
library("textshaping")
library("tidyverse")
library("RColorBrewer")
library("circlize")
library("pheatmap")
library("ComplexHeatmap")
library("EnhancedVolcano")

Note: Libraries should always be re-loaded when re-opening the script.

Afterwards, open the raw count data file in RStudio. This file is a table that results from mapping the sequencing reads to a reference genome and displays the number of reads assigned to each gene, in each sample. This tab-delimited file contains a header row with each sample name and each row name represents a unique Ensembl gene ID (ENSG).

In this work, the file “2D_counts.txt” was used for differential gene expression analysis of 2D Native hASCs vs 2D Glycoengineered hASCs. For assessing the impact of cell-cell tethering in the transcriptome, 2D Glycoengineered hASCs vs 3D Cellgels samples were pooled into “2Dv3D_counts.txt”.


For ease of comprehension, each function is described line-by-line in commented descriptions. For further information, please refer to each package’s vignette containing the comprehensive command list.


First, RNASeq analysis in 2D conditions was conducted:

#Reads count table and assigns it to a variable called "Counts".
Counts <- read.table("2D_counts.txt", header = TRUE, row.names = 1, sep ="\t")
Counts

Note: Before starting the analysis, confirm that the raw counts table contains the actual unprocessed reads and not the output of a previous DESeq2 workflow. Raw count values are stored as integers, whereas normalized counts are decimals numbers. DESeq2 differential expression analysis can only be conducted on raw unprocessed counts.

#Creates data frame called "coldata" that assigns experimental conditions ("2DControl" or "2DGly") to each sample and stores this factor in the "treatment" column. Also renames the original sample names in the .bam file to shorter and convenient IDs.
coldata <- DataFrame(id = c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5"), treatment= factor(c(rep("2DControl", times = 5), rep("2DGly", times = 5)), levels = c("2DControl", "2DGly")))

#Saves the formatted counts table to a .txt file.
write.table(Counts, file = "2D_counts_v2.txt", sep = "\t", quote=FALSE, col.names = TRUE)
coldata
DataFrame with 10 rows and 2 columns
            id treatment
   <character>  <factor>
1         2DC1 2DControl
2         2DC2 2DControl
3         2DC3 2DControl
4         2DC4 2DControl
5         2DC5 2DControl
6         2DG1 2DGly    
7         2DG2 2DGly    
8         2DG3 2DGly    
9         2DG4 2DGly    
10        2DG5 2DGly    


Next, a pre-filtering strategy is applied to the Counts table. Before proceeding with differential expression analysis, it is important to filter out genes that consistently present extremely low expression. This reduces multiple testing burden and helps in increasing the statistical power of the analysis, while keeping genes of interest.

Herein, counts per million (CPM) are used for pre-filtering rather than raw counts, as CPM accounts for differences in library sizes (i.e., sequencing depth = total number of counts) across samples, ensuring a more robust comparison. For instance, if samples yielded uneven library sizes (e.g. 15 M vs 40 M), a filtering criteria solely based on raw counts would not avoid favoring genes that are expressed in the larger library. CPM is calculated through edgeR as the raw counts divided by library size and multiplied by one million.

Note: CPM calculation through edgeR package is more sophisticated than manually applying the formula as it also automatically corrects for library composition using calcNormFactors.

As a rule of thumb, for library sizes of about 20 M reads, a CPM = 0.5 is used as it corresponds to a raw count of 10 - 15.

Note: To view the library size for each sample enter:

library_size <- colSums(Counts)
library_size.df <- as.data.frame(library_size)
rownames(library_size.df) <- coldata$id
library_size.df
#Cross-checking whether a CPM threshold of 0.5 corresponds to approximately 10 - 15 counts in these samples.
myCPM_colnames = c("C1", "C2", "C3", "C4", "C5", "G1", "G2", "G3", "G4", "G5")
myCPM <- cpm(Counts) #edgeR calculates the Counts Per Million (CPM) values for the Raw Counts object.
plot(myCPM[,1], Counts[,1], xlab="CPM", ylab="Raw Counts", main=paste("Sample", myCPM_colnames[1]), 
     ylim=c(0,50), xlim=c(0,3))
#Adds reference lines at x = 0.5 CPM and y = 10 Raw Counts
abline(v=0.5) + abline(h=10)

Considering library sizes in this dataset are on the order of 20 M, a threshold for genes with CPM > 0.5 was established. An additional requirement for expression in 3 or more libraries is used, as each group contains 5 replicates. Applying a stringent pre-filtering threshold ensures a more reliable and robust analysis. This criteria eliminates genes that are unlikely to be relevant due to (i) inconsistent reads across different samples, and (ii) increased fold change and variance associated with extremely low counts. In addition, this setup ensures that a gene will be retained even if it is only expressed in one experimental condition.

#Pre-filtering Counts to CPM > 0.5 in at least 3 different samples
keepCPM <- rowSums(cpm(Counts) > 0.5) >= 3
CPMCounts <- Counts[keepCPM,]
nrow(Counts)    #61552 = Total number of mapped genes
nrow(CPMCounts) #14194 = Number of genes after pre-filtering
#Running DESeq2 analysis
ddsCPM <- DESeqDataSetFromMatrix(countData = CPMCounts, colData = coldata, design = ~treatment)
ddsCPM <- DESeq(ddsCPM)
resddsCPM <- results(ddsCPM)
resddsCPM #Show DESeq2 results table
log2 fold change (MLE): treatment 2DGly vs 2DControl 
Wald test p-value: treatment 2DGly vs 2DControl 
DataFrame with 14194 rows and 6 columns
                 baseMean log2FoldChange     lfcSE      stat    pvalue      padj
                <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000160072  217.7990      0.2211119  0.138383  1.597820 0.1100829 0.2225651
ENSG00000142611   66.1508     -0.4467950  0.178257 -2.506466 0.0121945 0.0393714
ENSG00000225630  525.1312      0.0389046  0.147856  0.263125 0.7924545 0.8773250
ENSG00000131584   42.3183      0.4723784  0.288730  1.636059 0.1018273 0.2098193
ENSG00000169972  130.7999      0.0926638  0.124668  0.743283 0.4573102 0.6180456
...                   ...            ...       ...       ...       ...       ...
ENSG00000210196   52.4841     -0.1962727  0.203209 -0.965866  0.334111  0.499162
ENSG00000276256   89.2431      0.1866894  0.135644  1.376321  0.168722  0.306774
ENSG00000273748   44.4494      0.1237211  0.194458  0.636234  0.524624  0.677302
ENSG00000276197   12.6270      0.0673584  0.571645  0.117833  0.906200  0.946688
ENSG00000271254   50.1216     -0.1487663  0.201549 -0.738116  0.460444  0.620557
#Output DESeq2 Normalized Counts - Necessary for submitting RNASeq datasets to the SRA repository.
normalized_counts <- counts(ddsCPM, normalized=TRUE)
write.table(normalized_counts, file = "2D_normalizedcounts.txt", sep = "\t", quote=FALSE, col.names = TRUE)


Next, to improve interpretability, it is useful to convert the mapped ENSEMBL IDs (e.g., ENSG00000115414) to the more recognizable Gene Symbols (e.g., FN1 for the Fibronectin gene).

#Adds a new column "geneID" to the results table containing the corresponding Gene Symbols to each row
GeneIDs <- mapIds(org.Hs.eg.db,
              keys = row.names(resddsCPM),
              keytype = "ENSEMBL",
              column = "SYMBOL")
ENS_geneIDs <- stack(GeneIDs)
resddsCPM$geneID <- ENS_geneIDs$values
resddsCPM #Show the updated DESeq2 results table with the new geneID column
log2 fold change (MLE): treatment 2DGly vs 2DControl 
Wald test p-value: treatment 2DGly vs 2DControl 
DataFrame with 14194 rows and 7 columns
                 baseMean log2FoldChange     lfcSE      stat    pvalue      padj       geneID
                <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>  <character>
ENSG00000160072  217.7990      0.2211119  0.138383  1.597820 0.1100829 0.2225651       ATAD3B
ENSG00000142611   66.1508     -0.4467950  0.178257 -2.506466 0.0121945 0.0393714       PRDM16
ENSG00000225630  525.1312      0.0389046  0.147856  0.263125 0.7924545 0.8773250     MTND2P28
ENSG00000131584   42.3183      0.4723784  0.288730  1.636059 0.1018273 0.2098193        ACAP3
ENSG00000169972  130.7999      0.0926638  0.124668  0.743283 0.4573102 0.6180456        PUSL1
...                   ...            ...       ...       ...       ...       ...          ...
ENSG00000210196   52.4841     -0.1962727  0.203209 -0.965866  0.334111  0.499162           NA
ENSG00000276256   89.2431      0.1866894  0.135644  1.376321  0.168722  0.306774    LOC389831
ENSG00000273748   44.4494      0.1237211  0.194458  0.636234  0.524624  0.677302           NA
ENSG00000276197   12.6270      0.0673584  0.571645  0.117833  0.906200  0.946688           NA
ENSG00000271254   50.1216     -0.1487663  0.201549 -0.738116  0.460444  0.620557 LOC102724250

Note: Gene symbols (‘geneID’ column) with NA values are typically pseudogenes, some of those associated with producing antisense RNA transcripts. Not every ENSEMBL Gene ID corresponds to a HGNC Gene Symbol entry.

#Output full results: Normalized counts + DESeq2 results table
results_sum <- cbind(normalized_counts, resddsCPM)
#write.csv(results_sum, file = "2D_full_results.csv", quote=FALSE, col.names = TRUE) #.CSV alternative to upload data
write.table(results_sum, file = "2D_full_results.txt", sep = "\t", quote=FALSE, col.names = TRUE) #.TXT for better data fetching


After DESeq2 processing, differential expression analysis of genes can be conducted. In this work, genes are considered to be Differentially Expressed Genes (DEGs) if they exhibit an adjusted p-value < 0.05 (padj, Benjamini-Hochberg procedure for multiple testing correction) and present log2(fold change) > 1 or < -1.

Specifically, a log2(fold change) > 1 translates into an > 2-fold upregulated gene expression, whereas a log2(fold change) < -1 indicates a < 0.5 fold downregulation. By combining both of these thresholds, this robust criteria provides greater specificity for identifying genes that are not only statistically significant, but are also more likely to be biologically relevant.

#Analysis of DEGs distribution for Donut Plot

resddsCPM <- resddsCPM[!is.na(resddsCPM$padj), ] #Clean-up of genes containing 'NA' in 'padj' column. This occurs in genes filtered out by DESeq2 through automatic independent filtering or containing a sample with an extreme count outlier, as detected by Cook's distance.

sigs <- resddsCPM[resddsCPM$padj < 0.05,]       #Creates a subset of genes that meet the padj < 0.05 threshold.
sigsALL <- sigs[abs(sigs$log2FoldChange) > 1,]  #Total DEGs, use of absolute log2fold change values ('abs').
sigsUP <- sigs[sigs$log2FoldChange > 1,]        #Upregulated DEGs.
sigsDOWN <- sigs[sigs$log2FoldChange < -1,]     #Downregulated DEGs.

#Summary
#dim(resddsCPM) #Total genes = 14193
#dim(sigs)      #Total statistically significant genes = 4692
#dim(sigsALL)   #Total DEGs = 568
#dim(sigsUP)    #Upregulated DEGs = 180
#dim(sigsDOWN)  #Downregulated DEGs = 388

#Output DEGs
sigsALL.df <- as.data.frame(sigsALL)
sigsALL.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigsALL.df), keytype = "ENSEMBL", column = "SYMBOL") #Alternative way to add Gene Symbols to each row, available upon conversion to a data.frame object.
write.csv(sigsALL.df, "2D_DEGs.csv", quote = FALSE, col.names = TRUE)

#Output Volcano table
write.csv(resddsCPM, "2D_Volcano.csv", quote = FALSE, col.names = TRUE)


Next, the quality of the RNASeq dataset can be validated by plotting the per-gene dispersion estimates versus the mean expression. The visualization of this plot (plotDispEsts) allows for rapidly checking whether the fitted mean-dispersion relationship (red trend line) shows an appropriate fit to the data and ensure the validity of differential testing assumptions in downstream analysis.

#Plot fitted mean-dispersion relationship
plotDispEsts(ddsCPM)


In addition, it is important to perform a scatterplot of the biological coefficient of variation (BCV) against the average abundance of each gene. A BCV plot provides further visualization of the biological variability across samples in terms of gene expression, and whether this variance is adequate for fitting.

As found below, the common dispersion (Disp) is 0.01331 and the BCV (square root of Disp) is 0.1154. Typically, the asymptotic value for BCV tends to be within the 0.05 - 0.2 range for cell line experiments or genetically identical mice, whereas larger values (> 0.3) can be found in human subjects datasets. As a result, this BCV analysis further confirms the validity of the dataset for subsequent downstream analysis, as it will be demonstrated below.

#edgeR processing for fitting and calculations
condition = factor(c(rep("2DControl", times = 5), rep("2DGly", times = 5)))
dgeGlm <- DGEList(counts = CPMCounts, group = condition)
str(dgeGlm)
names(dgeGlm)
dgeGlm[["samples"]]
#Calculate common dispersion (Disp) and biological coefficient of variation (BCV)
design <- model.matrix(~condition)
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE) #Disp = 0.01331, BCV = 0.1154
Disp = 0.01331 , BCV = 0.1154 
#Plotting BCV scatterplot
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
plotBCV(dgeGlmTagDisp)


RNASeq datasets often contain thousands of genes across multiple samples, making it extremely complex to detect and analyze patterns of gene expression data. In this context, Principal Component Analysis (PCA) transforms data into a set of uncorrelated variables (i.e., principal components), which is useful for reducing data dimensionality, providing a blind and unbiased assessment of gene expression patterns across multiple samples and conditions. PC1 describes the most variation within the data, PC2 the second most, and so forth.

Therefore, PCA is highly valuable for (i) identifying gene expression trends (e.g., confirming the biological relevance of the treatment/disease or experimental condition), (ii) identifying clusters of samples and their relationships, and (iii) screening out any potential outliers or batch effects. Visualizing these effects assists in confirming the hypothesis tested in further DEGs downstream analysis.

To create a PCA plot, first a vst() (i.e., variance-stabilizing transformation) function should be applied to remove the dependence of the variance on the mean. This is because analysis of multi-dimensional data (such as PCA) works best for homoskedastic data (i.e., consistent variability), whereas in RNASeq data variance increases with the mean values. Unlike the other DESeq2 function (i.e., rlog()) function for this purpose, vst() is more sensitive to size factor and also normalizes with respect to library size.

#Output 2D PCA Plot
vsdata <- vst(ddsCPM, blind = FALSE) #Apply variance-stabilizing transformation to DESeq object 'ddsCPM'
#By default both these functions are **blind** to the sample design formula given to DESeq2 in DESeqDataSetFromMatrix(), however this is not appropriate if one expects large differences in counts explained by differences in experimental design. In such cases, the blind parameter should be set to 'FALSE' as recommended in DESeq2 vignette. 


#plotPCA(vsdata, intgroup = "treatment")          #Preview PC1 and PC2 % for correct axis labeling.
PCAraw <- plotPCA(vsdata, intgroup = "treatment") #Create a PCA plot using 'vsdata' and groups it by 'treatment'.
group.colors <- c("#619CFF","#00C19F")            #Vector containing color codes to be used in each group.
new_pca <- ggplot(PCAraw[["data"]],               #Create a ggplot using 'data' data.frame within 'PCAraw' object. 
  aes(PC1, PC2, color=group)) +                   #Construct aesthetic mapping using PC1 and PC2.
  scale_color_manual(values = group.colors) +     #Specify a vector of colors.
  geom_point(size = 2.5) +                        #Change marker size.
  xlab("PC1 (79%)") +                             #Text used for X axis.
  ylab("PC2 (9%)") +                              #Text used for Y axis.
  ylim(-10, 10) +                                 #Lower/upper limits for Y axis.
  xlim(-12, 12) +                                 #Lower/upper limits for X axis.
  theme_grey() +                                  #Set a grey background with white gridlines.
theme(axis.title.x = element_text(size = 14),     #Customize size and style of axis titles and text.
  axis.title.y = element_text(size = 14),
  axis.text.x = element_text(size = 12, face = "bold"),
  axis.text.y = element_text(size = 12, face = "bold"))

new_pca
ggsave("2D_PCA.svg", new_pca, dpi = 600, width = 2.6, height = 2, units = "in", scale = 2) #Exports plot in .svg


Apart from PCA, another important analysis to perform is to determine whether intergroup variability - which represents differences between experimental conditions in comparison with control conditions - is greater than the intragroup variability, representing technical or biological variance. This is achieved by calculating distance as represented by correlation across samples, typically using the Pearson’s correlation plot.

Pearson’s coefficient reflects the linear relationship between 2 variables accounting for differences in mean and standard deviation. For instance, the more similar the gene expression profiles for all transcripts are between two samples, the higher the correlation coefficient will be. These coefficients are calculated between all samples and are then visualized as a heatmap.

Similar to PCA, the Pearson’s correlation plot provides an overview of the overall strength of the biological effect of the experiment, assessing whether replicates and different conditions group together, as well as identify outliers that were not excluded during upstream steps (e.g., sample labeling or genomic alignment).

#Output 2D Pearson's Correlation Plot
vsdata <- vst(ddsCPM, blind = FALSE)                    #Apply variance-stabilizing transformation.
sampleDists <- cor(method = "pearson", assay(vsdata))   #Calculate Pearson correlation coefficients.
sampleDistMatrix <- as.matrix(sampleDists)              #Convert to matrix format.
rownames(sampleDistMatrix) <- paste(vsdata$id)          #Label Sample names.
colnames(sampleDistMatrix) <- paste(vsdata$id)          #Label Sample names.
colors <- colorRampPalette(brewer.pal(9, "Blues"))(255) #Color palette range for heatmap visualization.
SampleDistance = pheatmap(sampleDistMatrix, col=colors) #Generate heatmap with 'sampleDistMatrix' and 'colors' inputs.
SampleDistance #ggsave("SD_2D_Pearson.svg", SampleDistance, dpi = 600, width = 2.55, height = 2, units = "in", scale = 2) 

#Alternative: Euclidean Distances Correlation Plot, as described in DESeq2 vignette
sampleDists_Euclidean <- dist(t(assay(vsdata)))                 #Calculate Pearson correlation coefficients.
sampleDistMatrix_Euclidean <- as.matrix(sampleDists_Euclidean)  #Convert to matrix format.
rownames(sampleDistMatrix_Euclidean) <- paste(vsdata$id)        #Label Sample names.
colnames(sampleDistMatrix_Euclidean) <- paste(vsdata$id)        #Label Sample names.
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)    #Color palette range for heatmap visualization.
SampleDistancevignette = pheatmap(sampleDistMatrix_Euclidean,   #Generate heatmap with 'sampleDistMatrix' and 'colors' inputs.
         clustering_distance_rows=sampleDists_Euclidean,        #Clustering by Euclidean distance.
         clustering_distance_cols=sampleDists_Euclidean,        #Clustering by Euclidean distance.
         col=colors)
SampleDistancevignette #ggsave("2D_SD.svg", SampleDistancevignette, dpi = 600, width = 2.55, height = 2, units = "in", scale = 2)



Next, a Volcano plot can be assembled to provide a general overview of the gene expression profiles of 2D Native hASCs vs 2D Glycoengineered hASCs. Volcano plots provide a clear and intuitive visualization of statistical significance and the extent of fold changes for every gene detected. Here, each gene is plotted with their statistical significance (adjusted p-values on the y-axis) agaisnt the fold change (log2 fold change on the x-axis).

Then, a threshold is set to highlight genes that are differentially expressed (DEGs), as represented in the plot in the form of dashed lines. As aforementioned, this work considers as DEGs genes that present an adjusted p-value < 0.05 and log2(fold change) > 1 or < -1. This combination of significance and fold change information allows researchers to efficiently identify and prioritize genes of interest for further investigation. In addition, Volcano plots can also assist in screening out genes with high statistical significance but low fold changes, which may be significant due to large sample sizes but may not have substantial biological relevance.

#Output 2D Volcano plot with specific Genes of Interest labelled

resddsCPM.df <- as.data.frame(resddsCPM)                        #Convert DESeq2 results object to a data frame.

#Create custom key-value pairs for 'Up', 'Down', 'Non-DEG' expression. Achieved using nested 'ifelse' statements.
keyvals <-  ifelse(resddsCPM$padj > 0.05, 'grey30',             #If padj > 0.05 assign grey color.
            ifelse(resddsCPM$log2FoldChange < -1, 'royalblue',  #If log2foldchange < -1, assign blue color.
            ifelse(resddsCPM$log2FoldChange > 1, 'firebrick3',  #If log2foldchange > 1, assign red color.
                                                  'grey30')))   #Other values assigned grey color.
#Rename elements in 'keyvals' vector:
names(keyvals)[keyvals == 'firebrick3'] <- 'Up'                 #Assign 'Up' to elements in red.
names(keyvals)[keyvals == 'grey30'] <- 'Non-DEG'                #Assign 'Non-DEG' to elements in red.
names(keyvals)[keyvals == 'royalblue'] <- 'Down'                #Assign 'Up' to elements in red.

#Define vector containing Gene Symbols to highlight in plot
selected = c("PTGS2", "KRT19", "CXCL1", "SLC7A11", "NDNF", "PGD", "VCAM1", "GPC1", "ANGPT1", "GDF15")

#Volcano plot settings
enhvolc <- EnhancedVolcano(resddsCPM.df,                        #Create Volcano plot using 'resddsCPM.df'
        x = "log2FoldChange",                                   #Column name containing log2(fold change) values.
        y = "padj",                                             #Column name containing adjusted p-value values.
        #lab = NA,                                              #Uncomment this function for unlabeled genes.
        pCutoff = 0.05,                                         #Cut-off threshold for statistical significance.
        FCcutoff = 1,                                           #Cut-off threshold for absolute log2(fold change).
        title = NULL,                                           #Plot title.
        subtitle = NULL,                                        #Plot subtitle.
        caption = NULL,                                         #Plot caption.
        xlab = bquote(~Log[2]~"(Fold Change)"),                 #Label for x-axis. 
        ylab = bquote(~-Log[10] ~ "(" * italic(P[adj]) * ")"),  #Label for y-axis.
        colAlpha = 0.5,                                         #Alpha for color transparency.
        colCustom = keyvals,                                    #Override color scheme (e.g. color by pathway, cell-type or condition).
        pointSize = 1,                                          #Size of plotted points.
        labSize = 4,                                            #Size of labels for each point.
        drawConnectors = TRUE,                                  #Connect labels to each corresponding point by line connectors.
        widthConnectors = 0.65,                                 #Line width of connectors.
        arrowheads = FALSE,                                     #Draw arrowheads instead of simple line connectors.
        boxedLabels = TRUE,                                     #Draw labels within boxes.
        directionConnectors = 'both',                           #Direction to draw connectors (e.g., 'y', 'x' or 'both').
        xlim = c(-6, 6),                                        #Limits of x-axis.
        ylim = c(0, -log10(10e-255)),                           #Limits of y-axis.
#Exclude these if not highlighting specific genes
        lab = resddsCPM.df$geneID,                              #Input Gene Symbol column of data.frame object.
        selectLab = selected                                    #Vector containing a subset of genes of interest.
        )

#Customize axis limits and breaks
p1 <-  enhvolc +
    ggplot2::coord_cartesian(xlim = c(-6.4, 6.4),               #Zoom plot to the specific x-axis limit. 
                             ylim = c(0, 250)) +                #Zoom plot to the specific y-axis limit. 
    ggplot2::scale_x_continuous(breaks=seq(-6, 6, 2)) +         #Set spacing of breaks in plot x-axis.
    ggplot2::scale_y_continuous(breaks=seq(0, 250, 50))         #Set spacing of breaks in plot y-axis.
p1
ggsave("2D_Volcano_Wide_Label.svg", p1, width = 4.5, height = 2.8, units = "in", scale = 2)


Alternatively, Volcano plots can be customized via other packages (e.g., ‘ggplot2’) as subplot overlays for presenting a list of target genes with different shading specifications.

For instance, overlaying all genes that match a specific requirement (i.e., all ILs genes).


In addition, it is also possible to assemble a Volcano plot containing an overlay of genes of interest on top.


Alternatively, it is possible to assemble a MA plot, which conveys a similar overview to Volcano plots. However, this plot lacks the design flexibility that Volcano plots offer through their dedicated R packages.

#Alternative to Volcano - MA Plot. Less flexibility in customizing the final figure.
MAPlot_dds <- DESeq2::plotMA(resddsCPM, alpha = 0.05, #Set p-value cut-off threshold.
               colNonSig = "gray30",                  #Set color code for Non-DEG.
               colSig = "royalblue3",                 #Set color code for statistically significant.
               ylim = c(-7, 7))                       #Set y-axis limit.
abline(h = c(1, -1), lty = 2)                         #Draw horizontal lines at the specified y-axis values, line type - dashed.


Apart from Volcano plots, Heatmaps provide an important visualization tool in RNASeq for detecting gene expression trends and highlighting genes with relevant biological functions. Herein, each sample is assigned to a column, and each row corresponds to a DEG with color-coded expression level. Expression values are typically normalized and transformed (i.e., row z-scaling) to ensure that genes with varying expression levels can be compared easily. Heatmaps are usually paired with hierarchical clustering algorithms (e.g., Pearson’s) that group together genes and samples with similar expression profiles, generating a hierarchical tree-like dendrogram. Heatmaps are thus useful in RNASeq downstream analysis for revealing co-expression patterns and biological functions affected by experimental conditions.

#Transformation of data for Heatmap
vsdata <- vst(ddsCPM, blind = FALSE)        #Apply variance-stabilizing transformation.
vst_mat <- assay(vsdata)                    #Create matrix with transformed gene expression values.
vst_mat <- vst_mat[rownames(sigsALL.df), ]  #Subset matrix to include only DEGS, i.e., rows that match genes listed in 'sigsALL.df'.
Genename <- mapIds(org.Hs.eg.db,            #Map Ensembl ID from matrix to Gene Symbol and store in variable 'Genename'.
               keys = row.names(vst_mat),
              keytype = "ENSEMBL",
              column = "SYMBOL")

#Replace NA gene symbols with ENSEMBL IDs in vst_mat.
#(i) Identify which entries in 'Genename' have missing values (NA).
no_symbol_idx <- is.na(Genename)

#(ii) For 'Genename' entries with NA, replace values with original Ensembl ID extracted from 'vst_mat' row names.
Genename[no_symbol_idx] <- row.names(vst_mat)[no_symbol_idx]

#(iii) Convert 'Genename' vector into a single column data.frame, making it suitable for assignment to row names of 'vst_mat'
ENS_geneIDs <- stack(Genename)       

#(iv) Replace Ensembl ID entries with Gene Symbols when available, and retains original Ensembl ID if NA.
row.names(vst_mat) <- ENS_geneIDs$values                      


#Double-check that no DEGs were lost during assignment
#dim(vst_mat) #568 DEGs -> TRUE

#Scale genes z-score by row, as otherwise large values from highly expressed genes would dominate the plot
vst_mat_z <- t(scale(t(vst_mat)))

#Assign column names to each sample
colnames(vst_mat_z) <- c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5")

#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2),                        #Create segments for the color gradient.
                c("#2166AC", "white", "#B2182B"))   #Create color gradient with the specified color codes.

lgd_list <- list(title = "Z-score",                       #Set legend title.
                legend_height = unit(2.5, "cm"),          #Set height of the legend body.
                grid_width = unit(0.4, "cm"),             #Set width of each legend grid.
                title_position = "topcenter",             #Set position of the title.
                gap = unit(4, "mm"))                      #Set gap between heatmap and legend.
#Output full DEGs Heatmap
ht <- Heatmap(vst_mat_z,                                        #Matrix input for heatmap.
        col = col_fun,                                          #Set vector of colors for interpolation.
        show_column_names = FALSE,                              #Whether to show column labeling. 
        show_row_names = FALSE,                                 #Whether to show row labeling.
        #row_labels = sigs.df[rownames(vst_mat_z),]$Symbol,     #Uncomment to specify row labeling. 
        row_names_gp = gpar(fontsize = 5),                      #Set font size for row labeling.
        clustering_distance_columns = "pearson",                #Set clustering method for column dendrogram.
        clustering_distance_rows = "pearson",                   #Set clustering method for row dendrogram.
        #column_labels = colnames(vst_mat_z),                   #Uncomment for specifying column labeling. 
        column_names_gp = gpar(fontsize = 5),                   #Set font size for column labeling.
        row_dend_reorder = FALSE,                               #Whether to apply row dendrogram reordering. 
        column_dend_reorder = TRUE,                             #Whether to apply column dendrogram reordering. 
        show_row_dend = TRUE,                                   #Whether to show row dendrogram clustering.
        row_dend_width = unit(10, "mm"),                        #Set width of row dendrograms.
        show_column_dend = FALSE,                               #Whether to show column dendrogram clustering.
        width = unit(2, "in"),                                  #Set heatmap width.
        height = unit(10, "in"),                                #Set heatmap height.
        name = "Z-score")                                       #Set title.
#ht
#tiff("2D_Full_Heatmap.tiff", width = 4, height = 10, units = "in", res=600)
#draw(ht)
#dev.off()



Next, it becomes highly valuable to zoom in on specific gene subsets that are involved in relevant biological processes. Herein, DEGs related to extracellular matrix organization, glycoproteins and secreted factors are mapped to unveil the impact of the cell surface engineering process on the transcriptome.

#Create vectors of gene lists
Genelist_ECM = c("ADAMTS15", "WNT16", "WNT2", "WNT9A", "ANGPTL4", "BMP4", "MMRN2", "COL21A1", "COL22A1", "OMD", "TNXB", "THSD4")
Genelist_Glycoproteins = c("ADAMTS15", "APCDD1L", "ABCA6", "ABCA8", "ABCA9", "ABCB6", "ABCC2", "ABCC3", "ATP6V0E2", "BRINP1", "CEBPB", "CD200", "EPHA5", "FAM20A", "GPR137B", "GPR68", "GPRC5A", "GPRC5B", "KCNJ5-AS1", "KITLG", "KIT", "LIF", "LYPD1", "MPL", "NOX4", "SHANK2", "SLITRK4", "ST6GALNAC5", "ST6GAL2", "B3GNT8", "WNT16", "WNT2", "WNT9A", "ADCY4", "AFP", "MGAT4B", "AREG", "ANGPT1", "ANGPTL4", "AGTR1", "AQP1", "ACKR3", "B3GALT2", "B4GALNT1", "BMP4", "CDH18", "CDH6", "CDH8", "CHST6", "CDON", "CEMIP", "CHRM2", "CSGALNACT1", "COL21A1", "COL22A1", "C1RL", "CFHR1", "CNNM4", "ENPP4", "ELAPOR1", "ESM1", "EDNRB", "EFNA5", "EPOR", "GRPR", "GPC1", "GDF15", "GDF6", "HHIP", "HGF", "HMMR", "HSD11B1", "IGSF10", "IGFBP3", "ICAM1", "IFNE", "IL1RL1", "IL20RA", "IL34", "JUP", "LRRN3", "LIPG", "ME1", "MXRA5", "MMRN2", "NDNF", "NETO2", "NOTCH4", "OMD", "PLAT", "PDGFRL", "PLXDC1", "GALNT15", "GALNT3", "POPDC3", "KCNK1", "PCSK6", "PCSK9", "PTGDR", "PTGER3", "PTGS2", "PTPRQ", "PCDHB5", "RAET1E", "RNF128", "RNF150", "SPP1", "SEMA3C", "SEMA3D", "SEMA6D", "PRSS35", "SNAI1", "SLC1A2", "SLC12A9", "SLC14A1", "SLC2A12", "SLC2A5", "SLC24A3", "SLC38A1", "SLC38A4", "SLC4A8", "SLC7A11", "SLC7A14", "TNXB", "TENM4", "THBD", "TRPC6", "TMEM130", "VCAM1")
Genelist_Secreted = c("ADAMTS15", "ABCB6", "CCL26", "CXCL1", "CXCL2", "CXCL3", "CXCL6", "CXCL8", "C1QTNF2", "FAM20A", "HTRA3", "KCNJ5-AS1", "KITLG", "LIF", "WNT16", "WNT2", "WNT9A", "AKR1B10", "AFP", "ANGPT1", "ANGPTL4", "BMP4", "CEMIP", "COL21A1", "COL22A1", "C1RL", "CFHR1", "ESM1", "EDN1", "EPOR", "FABP5", "GPC1", "GDF15", "GDF6", "HHIP", "IGSF10", "IGFBP3", "IFNE", "IL1B", "IL1RL1", "IL34", "LIPG", "MXRA5", "MMRN2", "NDNF", "OMD", "PINLYP", "PLAT", "PDGFRL", "PLXDC1", "PCSK6", "PCSK9", "RAET1E", "SPP1", "SCG5", "SEMA3C", "SEMA3D", "PRSS35", "TNXB", "THSD4")

#Replace here [Genelist_ECM] with the name of the vector for specific genes c("XX", "XY", "XZ")
selected_DEGs <- Genelist_Secreted

#Filters the matrix to retain only the rows that correspond to the genes listed in 'selected_DEGs' vector.
vst_matSelected <- vst_mat[rownames(vst_mat) %in% selected_DEGs, ]

#Double check to ensure that all selected DEGs were retrieved
#dim(vst_mat)                    #568 DEGs
#dim(vst_matSelected)            #60 DEGs
#length(Genelist_Secreted)       #60 DEGs

#Apply row z-scaling
vst_matSelected_z <- t(scale(t(vst_matSelected)))

#Assign column names to each sample
colnames(vst_matSelected_z) <- c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5")

#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2),                        #Create segments for the color gradient.
                      c("#2166AC", "white", "#B2182B"))   #Create color gradient with the specified color codes.

lgd_list <- list(title = "Z-score",                       #Set legend title.
                title_gp = gpar(fontsize = 12),           #Set font size for legend title.
                legend_height = unit(2.5, "cm"),          #Set height of the legend body.
                legend_width = unit(5, "cm"),             #Set width of the legend body.
                grid_width = unit(0.4, "cm"),             #Set width of each legend grid.
                direction = "horizontal",                 #Whether scale is 'horizontal' or 'vertical'
                title_position = "topcenter",             #Set position of the title.
                gap = unit(4, "mm"))                      #Set gap between heatmap and legend.


ht_selected <- Heatmap(vst_matSelected_z,                 #Matrix input for heatmap.
        col = col_fun,                                    #Vector of colors for interpolation.
        rect_gp = gpar(col = "grey30", lwd = 0.15),       #Border line color and width of each cell.
        show_column_names = FALSE,                        #Whether to show column labeling.
        show_row_names = TRUE,                            #Whether to show row labeling.
        row_names_gp = gpar(fontsize = 5),                #Set font size for row labeling.
        clustering_distance_columns = "pearson",          #Set clustering method for column dendrogram.
        clustering_distance_rows = "pearson",             #Set clustering method for row dendrogram.
        #column_labels = colnames(vst_matSelected_z),     #Uncomment to specify column labeling.
        column_names_gp = gpar(fontsize = 5),             #Set font size for column labeling.
        row_dend_reorder = FALSE,                         #Whether to apply row dendrogram reordering.
        column_dend_reorder = TRUE,                       #Whether to apply column dendrogram reordering. 
        show_row_dend = FALSE,                            #Whether to show row dendrogram clustering.
        row_dend_width = unit(10, "mm"),                  #Set width of row dendrograms.
        show_column_dend = FALSE,                         #Whether to show column dendrogram clustering.
        width = unit(3, "in"),                            #Set heatmap width.
        height = unit(3.5, "in"),                         #Set heatmap height.
        heatmap_legend_param = lgd_list,                  #Reference object for legend parameters.
        #show_heatmap_legend = FALSE,                     #Uncomment to hide heatmap legend.
        name = "Z-score")                                 #Set title.

ht_selected = draw(ht_selected, heatmap_legend_side = "bottom") #Draws heatmap with legend at the bottom position.

#Output settings for each heatmap (W = width, H = height)
#2D_ECM: W = 3 x H = 3.5 | Output svg: W = 5 x H = 5
#2D_Secreted: W = 3 x H = 6.5 | Output svg: W = 4 x H = 4.5
#2D_Glycoproteins: W = 3 x H = 3.5 | Output svg: W = 4 x H = 4.5

svg(filename = "2D_Heatmap_Secreted.svg", width = 4, height = 4.5)
draw(ht_selected, heatmap_legend_side = "bottom")
dev.off()


In large Heatmaps, in order to avoid visual cluttering by labeling every gene, it is often valuable to label only a subset of genes.

#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2),                            #Create segments for the color gradient.
                      c("#2166AC", "white", "#B2182B"))       #Create color gradient with the specified color codes.

lgd_list <- list(title = "Z-score",                           #Set legend title.
                title_gp = gpar(fontsize = 12),               #Set font size for legend title.
                legend_height = unit(2.5, "cm"),              #Set height of the legend body.
                legend_width = unit(5, "cm"),                 #Set width of the legend body.
                grid_width = unit(0.4, "cm"),                 #Set width of each legend grid.
                direction = "horizontal",                     #Whether scale is 'horizontal' or 'vertical'
                title_position = "topcenter",                 #Set position of the title.
                gap = unit(4, "mm"))                          #Set gap between heatmap and legend.


#Create vector with subset of genes to label
Glycoproteins_select_label = c("ANGPT1", "B3GNT8", "MGAT4B", "EDNRB", "NDNF", "ADAMTS15", "ST6GALNAC5", "BMP4", "CDH8", "COL21A1", "IL34", "ICAM1", "NOTCH4", "OMD", "SEMA3D", "VCAM1")
Secreted_select_label = c("WNT2", "WNT16", "CXCL1", "CXCL2", "BMP4", "COL22A1", "EDN1", "GDF6", "IL1B", "CEMIP", "LIPG", "SCG5", "MXRA5", "IFNE", "SPP1", "ANGPT1", "NDNF", "KITLG", "AKR1B10")

#Assign genes subset to a static object for easily switching between plots
Genes_Label = Secreted_select_label

#Identify indices of the row names in the matrix that correspond to the genes listed in the vector 'Genes_Label' 
mark_at = which(rownames(vst_matSelected_z) %in% Genes_Label)

#Extract the row names of the matrix that correspond to indices in 'mark_at'. This step is a double-check to ensure that the selected rows do indeed match the original 'Genes_Label' vector.
Genes_DoubleCheck <- rownames(vst_matSelected_z)[mark_at]

#Check whether the two vectors contain the same elements, regardless of their order.
setequal(Genes_DoubleCheck, Genes_Label) #TRUE -> proceed to labeling.
#Note: Use 'Genes_DoubleCheck' object for labeling, because it outputs Gene symbols in the same order as 'mark_at'. This is necessary because 'mark_at' extracts row name numbers, which are automatically ordered numerically in the output vector.

#'anno_mark' function in ComplexHeatmap package enables linking heatmap annotation with customized labels.
foo = anno_mark(at = mark_at,                                 #Numeric index from original matrix.
                #link_width = unit(5, "mm"),                  #Uncomment to specify label link width.
                padding = 0.75,                               #Padding space between neighboring labels.
                labels = Genes_DoubleCheck,                   #Corresponding labels.
                labels_gp = gpar(fontsize = 10),              #Sets font size.
                which = "row")                                #Whether 'column' or 'row' annotation.

ht_selected_label <- Heatmap(vst_matSelected_z,               #Matrix input for heatmap.
                 col = col_fun,                               #Vector of colors for interpolation.
                 rect_gp = gpar(col = "grey30", lwd = 0.15),  #Border line color and width of each cell.
                 show_column_names = FALSE,                   #Whether to show column labeling.
                 show_row_names = FALSE,                      #Whether to show row labeling.
                 #row_names_gp = gpar(fontsize = 4),          #Uncomment to specify row label font size.
                 #row_names_side = "left",                    #Uncomment to specify row label side.
                 clustering_distance_columns = "pearson",     #Set clustering method for column dendrogram.
                 clustering_distance_rows = "pearson",        #Set clustering method for row dendrogram.
                 row_dend_reorder = FALSE,                    #Whether to apply row dendrogram reordering.
                 column_dend_reorder = TRUE,                  #Whether to apply column dendrogram reordering.
                 show_row_dend = FALSE,                       #Whether to show row dendrogram clustering.
                 show_column_dend = FALSE,                    #Whether to show column dendrogram clustering.
                 width = unit(3, "in"),                       #Set heatmap width.
                 height = unit(3.5, "in"),                    #Set heatmap height.
                 heatmap_legend_param = lgd_list,             #Reference object for legend parameters.
                 name = "Z-score")                            #Set title.

ht_merge = ht_selected_label + rowAnnotation(mark = foo)
ht_merge = draw(ht_merge, heatmap_legend_side = "bottom", align_heatmap_legend = "heatmap_left") #Note: 'align_heatmap_legend' = 'heatmap_center' position is offset because heatmap has been merged with 'rowAnnotation'.

svg(filename = "2D_Heatmap_Secreted_CompactLabel.svg", width = 4.5, height = 4.5)
dev.off()

---
title: "RNASeq analysis - R Pipeline"
author: "Pedro Lavrador"
output:
  html_notebook: default
  word_document: default
---

This pipeline was created using the following resources:

* [R](https://cran.r-project.org/bin/windows/base/)
* [RStudio](https://posit.co/download/rstudio-desktop/)

Within the RStudio file (**.rmd**), each code chunk can be executed by clicking the **Run** button within each chunk or by placing the cursor inside it and pressing **Cntrl+Shift+Enter**.

After installing *R* and *RStudio*, the following Bioconductor R packages are necessary for processing RNASeq raw data, conducting differential expression analysis and data visualization:

* [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) - *Differential gene expression analysis*
* [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) - *Differential gene expression analysis*
* [AnnotationDbi](https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html) - *Gene ID conversion*
* [org.Hs.eg.db](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) - *Genome annotation database for Human*
* [ComplexHeatmap](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html) - *Advanced heatmap data visualization*
* [EnhancedVolcano](https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html) - *Advanced volcano plot data visualization*


For installing packages, execute the code on the **Installation** section in the corresponding Bioconductor page.

For instance, for installing **DESeq2**:

```{r Package Installation, message=FALSE, warning=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DESeq2")
```


The remaining packages are part of the Comprehensive R Archive Network (CRAN), which is the official R packages repository. These can be installed in RStudio by going to **Tools** → **Install Packages** and in the **Install from** option select **Repository (CRAN)**, followed by specifying the intended packages:

* [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html) - *Data visualization and export options*
* [ggalt](https://cran.r-project.org/web/packages/ggalt/index.html) - *Adds extra data visualization flexibility*
* [ggrepel](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html) - *Adds extra data visualization flexibility*
* [textshaping](https://cran.r-project.org/web/packages/textshaping/index.html) - *Necessary for EnhancedVolcano package*
* [tidyverse](https://cran.r-project.org/web/packages/tidyverse/index.html) - *Data processing and visualization*
* [RColorBrewer](https://cran.r-project.org/web/packages/RColorBrewer/index.html) - *Color schemes for data visualization*
* [circlize](https://cran.r-project.org/web/packages/circlize/index.html) - *Necessary for ColorRamp scheme in heatmap data visualization*
* [pheatmap](https://cran.r-project.org/web/packages/pheatmap/pheatmap.pdf) - *Basic heatmap data visualization*

<br>
After installing every required package, these libraries should be loaded into RStudio so that their commands and functions are available to use. To do so, enter:
```{r echo=T, results='hide'}
#Loading libraries
library("DESeq2")
library("edgeR")
library("AnnotationDbi")
library("org.Hs.eg.db")
library("ggplot2")
library("ggalt")
library("ggrepel")
library("textshaping")
library("tidyverse")
library("RColorBrewer")
library("circlize")
library("pheatmap")
library("ComplexHeatmap")
library("EnhancedVolcano")
```

**Note:** Libraries should always be re-loaded when re-opening the script.
<br>
<br>

Afterwards, open the raw count data file in RStudio. This file is a table that results from mapping the sequencing reads to a reference genome and displays the number of reads assigned to each gene, in each sample. This tab-delimited file contains a header row with each sample name and each row name represents a unique Ensembl gene ID (ENSG).

In this work, the file **"2D_counts.txt"** was used for differential gene expression analysis of **2D Native hASCs** vs **2D Glycoengineered hASCs**. For assessing the impact of cell-cell tethering in the transcriptome, **2D Glycoengineered hASCs** vs **3D Cellgels** samples were pooled into **"2Dv3D_counts.txt"**.

<br>
**For ease of comprehension, each function is described line-by-line in commented descriptions. For further information, please refer to each package's vignette containing the comprehensive command list.**

<br>
**First, RNASeq analysis in 2D conditions was conducted:**
```{r Read 2D counts}
#Reads count table and assigns it to a variable called "Counts".
Counts <- read.table("2D_counts.txt", header = TRUE, row.names = 1, sep ="\t")
Counts
```
**Note:** Before starting the analysis, **confirm that the raw counts table contains the actual unprocessed reads** and not the output of a previous DESeq2 workflow. Raw count values are stored as integers, whereas normalized counts are decimals numbers. DESeq2 differential expression analysis can only be conducted on raw unprocessed counts.

```{r Formats and saves 2D counts}
#Creates data frame called "coldata" that assigns experimental conditions ("2DControl" or "2DGly") to each sample and stores this factor in the "treatment" column. Also renames the original sample names in the .bam file to shorter and convenient IDs.
coldata <- DataFrame(id = c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5"), treatment= factor(c(rep("2DControl", times = 5), rep("2DGly", times = 5)), levels = c("2DControl", "2DGly")))

#Saves the formatted counts table to a .txt file.
write.table(Counts, file = "2D_counts_v2.txt", sep = "\t", quote=FALSE, col.names = TRUE)
```

```{r Shows coldata}
coldata
```

<br>
Next, a **pre-filtering** strategy is applied to the **Counts** table. Before proceeding with differential expression analysis, it is important to filter out genes that consistently present extremely low expression. This reduces multiple testing burden and helps in increasing the statistical power of the analysis, while keeping genes of interest.

Herein, **counts per million (CPM)** are used for **pre-filtering** rather than **raw counts**, as **CPM** accounts for differences in **library sizes** (i.e., sequencing depth = total number of counts) across samples, ensuring a more robust comparison. *For instance, if samples yielded uneven library sizes (e.g. 15 M vs 40 M), a filtering criteria solely based on raw counts would not avoid favoring genes that are expressed in the larger library.* **CPM** is calculated through **edgeR** as the **raw counts divided by library size and multiplied by one million**.

**Note:** **CPM** calculation through **edgeR** package is more sophisticated than manually applying the formula as it also automatically corrects for library composition using *calcNormFactors*.

As a rule of thumb, for **library sizes** of about 20 M reads, a **CPM = 0.5** is used as it corresponds to a **raw count** of 10 - 15.

**Note:** To view the library size for each sample enter:
```{r Shows Library size, include=TRUE}
library_size <- colSums(Counts)
library_size.df <- as.data.frame(library_size)
rownames(library_size.df) <- coldata$id
library_size.df
```


```{r CPM Thresholding Check, results=FALSE}
#Cross-checking whether a CPM threshold of 0.5 corresponds to approximately 10 - 15 counts in these samples.
myCPM_colnames = c("C1", "C2", "C3", "C4", "C5", "G1", "G2", "G3", "G4", "G5")
myCPM <- cpm(Counts) #edgeR calculates the Counts Per Million (CPM) values for the Raw Counts object.
plot(myCPM[,1], Counts[,1], xlab="CPM", ylab="Raw Counts", main=paste("Sample", myCPM_colnames[1]), 
     ylim=c(0,50), xlim=c(0,3))
#Adds reference lines at x = 0.5 CPM and y = 10 Raw Counts
abline(v=0.5) + abline(h=10)
```

Considering library sizes in this dataset are on the order of 20 M, a threshold for genes with **CPM > 0.5** was established. An additional requirement for **expression in 3 or more libraries** is used, as each group contains 5 replicates. Applying a stringent **pre-filtering** threshold ensures a more reliable and robust analysis. This criteria eliminates genes that are unlikely to be relevant due to **(i)** inconsistent reads across different samples, and **(ii)** increased fold change and variance associated with extremely low counts. In addition, this setup ensures that a gene will be retained even if it is only expressed in one experimental condition.


```{r Pre-filtering, message=FALSE, results=FALSE}
#Pre-filtering Counts to CPM > 0.5 in at least 3 different samples
keepCPM <- rowSums(cpm(Counts) > 0.5) >= 3
CPMCounts <- Counts[keepCPM,]
nrow(Counts)    #61552 = Total number of mapped genes
nrow(CPMCounts) #14194 = Number of genes after pre-filtering
```

```{r Running DESeq2, message=FALSE, results=TRUE}
#Running DESeq2 analysis
ddsCPM <- DESeqDataSetFromMatrix(countData = CPMCounts, colData = coldata, design = ~treatment)
ddsCPM <- DESeq(ddsCPM)
resddsCPM <- results(ddsCPM)
resddsCPM #Show DESeq2 results table
```
```{r Output Normalized Counts}
#Output DESeq2 Normalized Counts - Necessary for submitting RNASeq datasets to the SRA repository.
normalized_counts <- counts(ddsCPM, normalized=TRUE)
write.table(normalized_counts, file = "2D_normalizedcounts.txt", sep = "\t", quote=FALSE, col.names = TRUE)
```

<br>
Next, to improve interpretability, it is useful to convert the mapped **ENSEMBL IDs** (e.g., ENSG00000115414) to the more recognizable **Gene Symbols** (e.g., FN1 for the Fibronectin gene). 

```{r Assign Gene IDs, message=FALSE, results=TRUE}
#Adds a new column "geneID" to the results table containing the corresponding Gene Symbols to each row
GeneIDs <- mapIds(org.Hs.eg.db,
              keys = row.names(resddsCPM),
              keytype = "ENSEMBL",
              column = "SYMBOL")
ENS_geneIDs <- stack(GeneIDs)
resddsCPM$geneID <- ENS_geneIDs$values
resddsCPM #Show the updated DESeq2 results table with the new geneID column
```
**Note:** Gene symbols ('geneID' column) with *NA* values are typically pseudogenes, some of those associated with producing antisense RNA transcripts. Not every ENSEMBL Gene ID corresponds to a HGNC Gene Symbol entry.

```{r Output Full Results}
#Output full results: Normalized counts + DESeq2 results table
results_sum <- cbind(normalized_counts, resddsCPM)
#write.csv(results_sum, file = "2D_full_results.csv", quote=FALSE, col.names = TRUE) #.CSV alternative to upload data
write.table(results_sum, file = "2D_full_results.txt", sep = "\t", quote=FALSE, col.names = TRUE) #.TXT for better data fetching
```

<br>
After DESeq2 processing, differential expression analysis of genes can be conducted. In this work, genes are considered to be **Differentially Expressed Genes (DEGs)** if they exhibit an **adjusted p-value < 0.05** (**padj**, Benjamini-Hochberg procedure for multiple testing correction) and present **log2(fold change) > 1** or **< -1**.

Specifically, a **log2(fold change) > 1** translates into an **> 2-fold upregulated gene expression**, whereas a **log2(fold change) < -1** indicates a **< 0.5 fold downregulation**. By combining both of these thresholds, this robust criteria provides greater specificity for identifying genes that are not only statistically significant, but are also more likely to be biologically relevant.


```{r Output DEGs, warning=FALSE, message=FALSE, results=FALSE}
#Analysis of DEGs distribution for Donut Plot

resddsCPM <- resddsCPM[!is.na(resddsCPM$padj), ] #Clean-up of genes containing 'NA' in 'padj' column. This occurs in genes filtered out by DESeq2 through automatic independent filtering or containing a sample with an extreme count outlier, as detected by Cook's distance.

sigs <- resddsCPM[resddsCPM$padj < 0.05,]       #Creates a subset of genes that meet the padj < 0.05 threshold.
sigsALL <- sigs[abs(sigs$log2FoldChange) > 1,]  #Total DEGs, use of absolute log2fold change values ('abs').
sigsUP <- sigs[sigs$log2FoldChange > 1,]        #Upregulated DEGs.
sigsDOWN <- sigs[sigs$log2FoldChange < -1,]     #Downregulated DEGs.

#Summary
#dim(resddsCPM) #Total genes = 14193
#dim(sigs)      #Total statistically significant genes = 4692
#dim(sigsALL)   #Total DEGs = 568
#dim(sigsUP)    #Upregulated DEGs = 180
#dim(sigsDOWN)  #Downregulated DEGs = 388

#Output DEGs
sigsALL.df <- as.data.frame(sigsALL)
sigsALL.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigsALL.df), keytype = "ENSEMBL", column = "SYMBOL") #Alternative way to add Gene Symbols to each row, available upon conversion to a data.frame object.
write.csv(sigsALL.df, "2D_DEGs.csv", quote = FALSE, col.names = TRUE)

#Output Volcano table
write.csv(resddsCPM, "2D_Volcano.csv", quote = FALSE, col.names = TRUE)
```

<br>
Next, the quality of the RNASeq dataset can be validated by plotting the **per-gene dispersion estimates versus the mean expression**. The visualization of this plot (**plotDispEsts**) allows for rapidly checking whether the **fitted mean-dispersion relationship** (*red trend line*) shows an **appropriate fit** to the data and ensure the **validity** of differential testing assumptions in downstream analysis.

```{r Plot Dispersions}
#Plot fitted mean-dispersion relationship
plotDispEsts(ddsCPM)
```

<br>
In addition, it is important to perform a scatterplot of the **biological coefficient of variation** (**BCV**) against the average abundance of each gene. A **BCV plot** provides further visualization of the **biological variability** across samples in terms of gene expression, and whether this variance is adequate for fitting.


As found below, the common dispersion (**Disp**) is **0.01331** and the **BCV** (square root of **Disp**) is **0.1154**. Typically, the asymptotic value for **BCV** tends to be within the **0.05 - 0.2** range for cell line experiments or genetically identical mice, whereas larger values (**> 0.3**) can be found in human subjects datasets. As a result, this **BCV** analysis further confirms the validity of the dataset for subsequent downstream analysis, as it will be demonstrated below.


```{r edgeR Processing and BCV Plot, message=FALSE,echo=TRUE, results='hide'}
#edgeR processing for fitting and calculations
condition = factor(c(rep("2DControl", times = 5), rep("2DGly", times = 5)))
dgeGlm <- DGEList(counts = CPMCounts, group = condition)
str(dgeGlm)
names(dgeGlm)
dgeGlm[["samples"]]
```

```{r BCV value and BCV Plot, echo=TRUE}
#Calculate common dispersion (Disp) and biological coefficient of variation (BCV)
design <- model.matrix(~condition)
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE) #Disp = 0.01331, BCV = 0.1154


#Plotting BCV scatterplot
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
plotBCV(dgeGlmTagDisp)
```


```{r EdgeR SmearPlot, include=FALSE}
#edgeR alternative to Volcano Plot
fit <- glmFit(dgeGlmTagDisp, design)
colnames(coef(fit))  
lrt <- glmLRT(fit, coef = 2)  
ttGlm <- topTags(lrt, n = Inf)  
summary(deGlm <- decideTestsDGE(lrt, p = 0.05, adjust = "fdr"))
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
plotSmear(lrt, de.tags = tagsGlm)
```

<br>
RNASeq datasets often contain thousands of genes across multiple samples, making it extremely complex to detect and analyze patterns of gene expression data. In this context, **Principal Component Analysis** (**PCA**) transforms data into a set of uncorrelated variables (i.e., principal components), which is useful for reducing data dimensionality, providing a **blind** and **unbiased assessment** of gene expression patterns across multiple samples and conditions. PC1 describes the most variation within the data, PC2 the second most, and so forth.

Therefore, **PCA** is highly valuable for **(i)** identifying **gene expression trends** (e.g., confirming the biological relevance of the treatment/disease or experimental condition), **(ii)** identifying **clusters** of samples and their relationships, and **(iii)** screening out any potential **outliers** or **batch effects**. Visualizing these effects assists in confirming the hypothesis tested in further **DEGs** downstream analysis.


To create a **PCA plot**, first a **vst()** (*i.e., variance-stabilizing transformation*) function should be applied to remove the dependence of the variance on the mean. This is because analysis of multi-dimensional data (such as PCA) works best for homoskedastic data (i.e., consistent variability), whereas in RNASeq data variance increases with the mean values. Unlike the other DESeq2 function (i.e., *rlog()*) function for this purpose, **vst()** is more sensitive to size factor and also normalizes with respect to library size.


```{r 2D PCA}
#Output 2D PCA Plot
vsdata <- vst(ddsCPM, blind = FALSE) #Apply variance-stabilizing transformation to DESeq object 'ddsCPM'
#By default both these functions are **blind** to the sample design formula given to DESeq2 in DESeqDataSetFromMatrix(), however this is not appropriate if one expects large differences in counts explained by differences in experimental design. In such cases, the blind parameter should be set to 'FALSE' as recommended in DESeq2 vignette. 


#plotPCA(vsdata, intgroup = "treatment")          #Preview PC1 and PC2 % for correct axis labeling.
PCAraw <- plotPCA(vsdata, intgroup = "treatment") #Create a PCA plot using 'vsdata' and groups it by 'treatment'.
group.colors <- c("#619CFF","#00C19F")            #Vector containing color codes to be used in each group.
new_pca <- ggplot(PCAraw[["data"]],               #Create a ggplot using 'data' data.frame within 'PCAraw' object. 
  aes(PC1, PC2, color=group)) +                   #Construct aesthetic mapping using PC1 and PC2.
  scale_color_manual(values = group.colors) +     #Specify a vector of colors.
  geom_point(size = 2.5) +                        #Change marker size.
  xlab("PC1 (79%)") +                             #Text used for X axis.
  ylab("PC2 (9%)") +                              #Text used for Y axis.
  ylim(-10, 10) +                                 #Lower/upper limits for Y axis.
  xlim(-12, 12) +                                 #Lower/upper limits for X axis.
  theme_grey() +                                  #Set a grey background with white gridlines.
theme(axis.title.x = element_text(size = 14),     #Customize size and style of axis titles and text.
  axis.title.y = element_text(size = 14),
  axis.text.x = element_text(size = 12, face = "bold"),
  axis.text.y = element_text(size = 12, face = "bold"))

new_pca
ggsave("2D_PCA.svg", new_pca, dpi = 600, width = 2.6, height = 2, units = "in", scale = 2) #Exports plot in .svg
```
<br>
Apart from **PCA**, another important analysis to perform is to determine whether **intergroup variability** - which represents differences between experimental conditions in comparison with control conditions - is greater than the **intragroup variability**, representing technical or biological variance. This is achieved by **calculating distance** as represented by correlation across samples, typically using the **Pearson's correlation plot**. 

**Pearson's** coefficient reflects the linear relationship between 2 variables accounting for differences in mean and standard deviation. For instance, the **more similar the gene expression profiles for all transcripts** are between two samples, the **higher the correlation coefficient** will be. These coefficients are calculated between all samples and are then **visualized as a heatmap**.

Similar to PCA, the **Pearson's correlation plot** provides an overview of the **overall strength of the biological effect** of the experiment, assessing whether replicates and different conditions group together, as well as identify outliers that were not excluded during upstream steps (e.g., sample labeling or genomic alignment).



```{r 2D Sample Distances Correlation}
#Output 2D Pearson's Correlation Plot
vsdata <- vst(ddsCPM, blind = FALSE)                    #Apply variance-stabilizing transformation.
sampleDists <- cor(method = "pearson", assay(vsdata))   #Calculate Pearson correlation coefficients.
sampleDistMatrix <- as.matrix(sampleDists)              #Convert to matrix format.
rownames(sampleDistMatrix) <- paste(vsdata$id)          #Label Sample names.
colnames(sampleDistMatrix) <- paste(vsdata$id)          #Label Sample names.
colors <- colorRampPalette(brewer.pal(9, "Blues"))(255) #Color palette range for heatmap visualization.
SampleDistance = pheatmap(sampleDistMatrix, col=colors) #Generate heatmap with 'sampleDistMatrix' and 'colors' inputs.
SampleDistance #ggsave("SD_2D_Pearson.svg", SampleDistance, dpi = 600, width = 2.55, height = 2, units = "in", scale = 2) 
```

```{r Euclidean Distances Correlation DESeq2 vignette}
#Alternative: Euclidean Distances Correlation Plot, as described in DESeq2 vignette
sampleDists_Euclidean <- dist(t(assay(vsdata)))                 #Calculate Pearson correlation coefficients.
sampleDistMatrix_Euclidean <- as.matrix(sampleDists_Euclidean)  #Convert to matrix format.
rownames(sampleDistMatrix_Euclidean) <- paste(vsdata$id)        #Label Sample names.
colnames(sampleDistMatrix_Euclidean) <- paste(vsdata$id)        #Label Sample names.
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)    #Color palette range for heatmap visualization.
SampleDistancevignette = pheatmap(sampleDistMatrix_Euclidean,   #Generate heatmap with 'sampleDistMatrix' and 'colors' inputs.
         clustering_distance_rows=sampleDists_Euclidean,        #Clustering by Euclidean distance.
         clustering_distance_cols=sampleDists_Euclidean,        #Clustering by Euclidean distance.
         col=colors)
SampleDistancevignette #ggsave("2D_SD.svg", SampleDistancevignette, dpi = 600, width = 2.55, height = 2, units = "in", scale = 2)
```
<br>
<br>
Next, a **Volcano plot** can be assembled to provide a general overview of the gene expression profiles of 2D Native hASCs vs 2D Glycoengineered hASCs. **Volcano plots** provide a clear and intuitive visualization of **statistical significance** and the **extent of fold changes** for every gene detected. Here, each gene is plotted with their statistical significance (adjusted p-values on the y-axis) agaisnt the fold change (log2 fold change on the x-axis). 

Then, a **threshold** is set to highlight genes that are differentially expressed (**DEGs**), as represented in the plot in the form of dashed lines. As aforementioned, this work considers as **DEGs** genes that present an **adjusted p-value < 0.05** and **log2(fold change) > 1** or **< -1**. This **combination of significance and fold change** information allows researchers to efficiently **identify** and **prioritize** **genes of interest** for further investigation. In addition, **Volcano plots** can also assist in screening out genes with high statistical significance but low fold changes, which may be significant due to large sample sizes but may not have substantial biological relevance.

```{r 2D Volcano plot, warning=FALSE, message=FALSE}
#Output 2D Volcano plot with specific Genes of Interest labelled

resddsCPM.df <- as.data.frame(resddsCPM)                        #Convert DESeq2 results object to a data frame.

#Create custom key-value pairs for 'Up', 'Down', 'Non-DEG' expression. Achieved using nested 'ifelse' statements.
keyvals <-  ifelse(resddsCPM$padj > 0.05, 'grey30',             #If padj > 0.05 assign grey color.
            ifelse(resddsCPM$log2FoldChange < -1, 'royalblue',  #If log2foldchange < -1, assign blue color.
            ifelse(resddsCPM$log2FoldChange > 1, 'firebrick3',  #If log2foldchange > 1, assign red color.
                                                  'grey30')))   #Other values assigned grey color.
#Rename elements in 'keyvals' vector:
names(keyvals)[keyvals == 'firebrick3'] <- 'Up'                 #Assign 'Up' to elements in red.
names(keyvals)[keyvals == 'grey30'] <- 'Non-DEG'                #Assign 'Non-DEG' to elements in red.
names(keyvals)[keyvals == 'royalblue'] <- 'Down'                #Assign 'Up' to elements in red.

#Define vector containing Gene Symbols to highlight in plot
selected = c("PTGS2", "KRT19", "CXCL1", "SLC7A11", "NDNF", "PGD", "VCAM1", "GPC1", "ANGPT1", "GDF15")

#Volcano plot settings
enhvolc <- EnhancedVolcano(resddsCPM.df,                        #Create Volcano plot using 'resddsCPM.df'
        x = "log2FoldChange",                                   #Column name containing log2(fold change) values.
        y = "padj",                                             #Column name containing adjusted p-value values.
        #lab = NA,                                              #Uncomment this function for unlabeled genes.
        pCutoff = 0.05,                                         #Cut-off threshold for statistical significance.
        FCcutoff = 1,                                           #Cut-off threshold for absolute log2(fold change).
        title = NULL,                                           #Plot title.
        subtitle = NULL,                                        #Plot subtitle.
        caption = NULL,                                         #Plot caption.
        xlab = bquote(~Log[2]~"(Fold Change)"),                 #Label for x-axis. 
        ylab = bquote(~-Log[10] ~ "(" * italic(P[adj]) * ")"),  #Label for y-axis.
        colAlpha = 0.5,                                         #Alpha for color transparency.
        colCustom = keyvals,                                    #Override color scheme (e.g. color by pathway, cell-type or condition).
        pointSize = 1,                                          #Size of plotted points.
        labSize = 4,                                            #Size of labels for each point.
        drawConnectors = TRUE,                                  #Connect labels to each corresponding point by line connectors.
        widthConnectors = 0.65,                                 #Line width of connectors.
        arrowheads = FALSE,                                     #Draw arrowheads instead of simple line connectors.
        boxedLabels = TRUE,                                     #Draw labels within boxes.
        directionConnectors = 'both',                           #Direction to draw connectors (e.g., 'y', 'x' or 'both').
        xlim = c(-6, 6),                                        #Limits of x-axis.
        ylim = c(0, -log10(10e-255)),                           #Limits of y-axis.
#Exclude these if not highlighting specific genes
        lab = resddsCPM.df$geneID,                              #Input Gene Symbol column of data.frame object.
        selectLab = selected                                    #Vector containing a subset of genes of interest.
        )

#Customize axis limits and breaks
p1 <-  enhvolc +
    ggplot2::coord_cartesian(xlim = c(-6.4, 6.4),               #Zoom plot to the specific x-axis limit. 
                             ylim = c(0, 250)) +                #Zoom plot to the specific y-axis limit. 
    ggplot2::scale_x_continuous(breaks=seq(-6, 6, 2)) +         #Set spacing of breaks in plot x-axis.
    ggplot2::scale_y_continuous(breaks=seq(0, 250, 50))         #Set spacing of breaks in plot y-axis.
p1
ggsave("2D_Volcano_Wide_Label.svg", p1, width = 4.5, height = 2.8, units = "in", scale = 2)
```
<br>
Alternatively, **Volcano plots** can be customized via other packages (e.g., *'ggplot2'*) as subplot overlays for presenting a list of target genes with different shading specifications.

For instance, **overlaying** all genes that **match a specific requirement** (i.e., all ILs genes).

```{r ggplot2 Volcano, include=FALSE, echo=FALSE}
#Alternative Volcano plot flexibility through 'ggplot2' package. Note: doesn't have EnhancedVolcano flexibility for correcting p-values = 0.

resddsCPMVOLC.df <- resddsCPM.df
#Add a column to the data frame to specify if genes are UP- or DOWN- regulated DEGs
resddsCPMVOLC.df$diffexpressed <- "Non-DEG"
#UPREGULATED DEGs
resddsCPMVOLC.df$diffexpressed[resddsCPMVOLC.df$log2FoldChange > 1 & resddsCPMVOLC.df$padj < 0.05] <- "Upregulated"
#DOWNREGULATED DEGs
resddsCPMVOLC.df$diffexpressed[resddsCPMVOLC.df$log2FoldChange < -1 & resddsCPMVOLC.df$padj < 0.05] <- "Downregulated"

#Volcano 'ggplot2' theme
volc_theme <- theme_set(theme_classic(base_size = 20) +
            theme(
              axis.title.y = element_text(face = "bold", margin = margin(0,10,0,0), size = 15, color = 'black'),
              axis.title.x = element_text(hjust = 0.5, face = "bold", margin = margin(10,0,0,0), size = 15, color = 'black'),
              plot.title = element_text(hjust = 0.5)
            ))

ggplot(data = resddsCPMVOLC.df,
       aes(x = log2FoldChange,
           y = -log10(padj),
       col = diffexpressed)) +
       geom_point(size = 1) + 
       geom_vline(xintercept = c(-1, 1), col = 'grey30', linetype = 'dashed') +
       geom_hline(yintercept = c(-log10(0.05)), col = 'grey30', linetype = 'dashed') +
       scale_color_manual(values = c("royalblue", "grey30", "firebrick3"), #Set color vectors for gene classifications
           labels = c("Downregulated", "Non-DEGs", "Upregulated")) + #Set label names to overwrite text in data.frame (UP, DOWN, NO)
       coord_cartesian(ylim = c(0, 300), xlim = c(-11, 11)) + #Set limits due to padj = 0 on some genes, they will still be plotted but collapsed on edge
       labs(color = 'Legend', #Legend title, 
           x = expression("Log"[2]*"(Fold Change)"),
           y = expression("-Log"[10]*P[adj])) + 
           scale_x_continuous(breaks = c(seq(-10, 10, 5)), #Customize x-axis ticks
           limits = c(-20, 20)) + #Show datapoints within this region
volc_theme
```

```{r ggplot2 Volcano settings, include=FALSE}
resddsCPMVOLC.df %>%
  distinct(diffexpressed) %>%
  pull()  
# "NO"  "UP"  "DOWN"

cols <- c("Downregulated" = "royalblue", "Upregulated" = "firebrick3", "Non-DEG" = "grey30") 
sizes <- c("Downregulated" = 1, "Upregulated" = 1, "Non-DEG" = 1) 
alphas <- c("Downregulated" = 0.5, "Upregulated" = 0.5, "Non-DEG" = 0.5)
# "#ffad73", "down" = "#26b3ff"

ggplotVolc <- ggplot(data = resddsCPMVOLC.df, aes(x = log2FoldChange, y = -log10(padj),
  fill = diffexpressed,
  size = diffexpressed,
  alpha = diffexpressed)) +
  geom_point(shape = 21,
             colour = "black") + 
  geom_vline(xintercept = c(-1, 1), col = 'grey30', linetype = 'dashed') +
  geom_hline(yintercept = c(-log10(0.05)), col = 'grey30', linetype = 'dashed') +
  scale_fill_manual(values = cols) + # Modify point color
  scale_size_manual(values = sizes) + # Modify point size
  scale_alpha_manual(values = alphas) + # Modify point transparency
  scale_x_continuous(breaks = c(seq(-10, 10, 5)),       
                     limits = c(-10, 10)) +
  labs(color = 'Legend', #legend_title, 
       x = expression("log"[2]*"Fold Change"), y = expression("-log"[10]*P[adj]))

ggplotVolc
#ggsave("2Dvs3D_ggplotVolc.svg", ggplotVolc, width = 3.8, height = 2.5, units = "in", scale = 2)
```

```{r GGPLOT Volcano Overlay and Specific Labeling 2, echo=FALSE}
#--------------OVERLAY TARGET GENES ON GREYED OUT VOLCANO PLOTS-----------------------
# Add subplot layer to the main volcano plot -----------------------------------
ils <- str_subset(resddsCPMVOLC.df$geneID, "^IL") #"^(I|L){2}[0-9]+$")  #This fetches gene symbols starting with IL in the name.
#length(ils) #52

il_genes <- resddsCPMVOLC.df %>%
  filter(geneID %in% ils)

ggplotVolc_sublayer <- ggplot(data = resddsCPMVOLC.df, # Original data  
       aes(x = log2FoldChange, y = -log10(padj))) + 
  geom_point(colour = "grey", alpha = 0.5) +
  geom_point(data = il_genes, # New layer containing data subset il_genes       
             size = 2,
             shape = 21,
             fill = "firebrick",
             colour = "black") +
  geom_vline(xintercept = c(-1, 1), col = 'grey30', linetype = 'dashed') +
  geom_hline(yintercept = c(-log10(0.05)), col = 'grey30', linetype = 'dashed') +
  labs(color = 'Legend', #legend_title, 
       x = expression("log"[2]*"(Fold Change)"), y = expression("-log"[10]*P[adj])) + 
  scale_x_continuous(breaks = c(seq(-10, 10, 5)), #Customize X axis ticks
                     limits = c(-10, 10)) + # Show datapoints within this region
  volc_theme

ggplotVolc_sublayer
#ggsave("2D_ggplotVolc_sub.svg", ggplotVolc_sublayer, width = 3.8, height = 3, units = "in", scale = 2)
```
<br>
In addition, it is also possible to assemble a **Volcano plot** containing an **overlay** of genes of interest on top.

![](Treated/2Dvs3D_ggplotVolc_labels.svg)

```{r GGPLOT Volcano Overlay and Specific Labeling 3, include=FALSE}
#-----------------OVERLAY LABELED GENES------------------------------------
Selected_genes <- resddsCPMVOLC.df %>%
  filter(geneID %in% c("IL32", "IL34", "IL1B", "COL7A1", "COL10A1", "MMP13", "LAMC2", "LAMB3", "COL24A1", "LAMA1", "VIT"))
UP_Selected_genes <- resddsCPMVOLC.df %>%
  filter(geneID %in% c("IL32", "IL34", "IL1B", "COL7A1", "COL10A1", "MMP13", "LAMC2", "LAMB3"))
DOWN_Selected_genes <- resddsCPMVOLC.df %>%
  filter(geneID %in% c("COL24A1", "LAMA1", "VIT"))

ggplotVolc_labels <- ggplot(data = resddsCPMVOLC.df,
       aes(x = log2FoldChange, y = -log10(padj))) + 
       geom_point(aes(colour = diffexpressed), 
             alpha = 0.2, 
             shape = 16,
             size = 1) + 
  geom_point(data = UP_Selected_genes,
             shape = 21,
             size = 2, 
             fill = "steelblue", 
             colour = "black") + 
  geom_point(data = DOWN_Selected_genes,
             shape = 21,
             size = 2, 
             fill = "firebrick3", 
             colour = "black") + 
   geom_hline(yintercept = -log10(0.05),
             linetype = "dashed") + 
  geom_vline(xintercept = c(log2(0.5), log2(2)),
             linetype = "dashed") +
  geom_label_repel(data = Selected_genes, # Add labels last to appear as the top layer  
                   aes(label = geneID),
                   force = 2,
                   nudge_y = 3,
                   #direction = "y",
                   size = 2,
                   min.segment.length = unit(0, "lines"),
                   label.padding = unit(0.2, "lines")) +
  scale_colour_manual(values = cols) + 
  scale_x_continuous(breaks = c(seq(-10, 10, 2)),     
                     limits = c(-10, 10)) +
  volc_theme

ggplotVolc_labels
#ggsave("2Dvs3D_ggplotVolc_labels.svg", ggplotVolc_labels, width = 3.8, height = 2.5, units = "in", scale = 2)
```

<br>
Alternatively, it is possible to assemble a **MA plot**, which conveys a similar overview to **Volcano plots**. However, this plot lacks the design flexibility that Volcano plots offer through their dedicated R packages.

```{r Alternative 2D MA PLOT}
#Alternative to Volcano - MA Plot. Less flexibility in customizing the final figure.
MAPlot_dds <- DESeq2::plotMA(resddsCPM, alpha = 0.05, #Set p-value cut-off threshold.
               colNonSig = "gray30",                  #Set color code for Non-DEG.
               colSig = "royalblue3",                 #Set color code for statistically significant.
               ylim = c(-7, 7))                       #Set y-axis limit.
abline(h = c(1, -1), lty = 2)                         #Draw horizontal lines at the specified y-axis values, line type - dashed.
```
<br>
Apart from **Volcano plots**, **Heatmaps** provide an important visualization tool in RNASeq for detecting gene expression trends and highlighting genes with relevant biological functions. Herein, each sample is assigned to a column, and each row corresponds to a **DEG** with **color-coded expression level**. Expression values are typically normalized and transformed (i.e., **row z-scaling**) to ensure that genes with varying expression levels can be compared easily. **Heatmaps** are usually paired with **hierarchical clustering** algorithms (e.g., Pearson's) that group together genes and samples with similar expression profiles, generating a hierarchical tree-like dendrogram. **Heatmaps** are thus useful in RNASeq downstream analysis for revealing **co-expression patterns** and **biological functions** affected by experimental conditions.


```{r Heatmap setup, message=FALSE, warning=FALSE}
#Transformation of data for Heatmap
vsdata <- vst(ddsCPM, blind = FALSE)        #Apply variance-stabilizing transformation.
vst_mat <- assay(vsdata)                    #Create matrix with transformed gene expression values.
vst_mat <- vst_mat[rownames(sigsALL.df), ]  #Subset matrix to include only DEGS, i.e., rows that match genes listed in 'sigsALL.df'.
Genename <- mapIds(org.Hs.eg.db,            #Map Ensembl ID from matrix to Gene Symbol and store in variable 'Genename'.
               keys = row.names(vst_mat),
              keytype = "ENSEMBL",
              column = "SYMBOL")

#Replace NA gene symbols with ENSEMBL IDs in vst_mat.
#(i) Identify which entries in 'Genename' have missing values (NA).
no_symbol_idx <- is.na(Genename)

#(ii) For 'Genename' entries with NA, replace values with original Ensembl ID extracted from 'vst_mat' row names.
Genename[no_symbol_idx] <- row.names(vst_mat)[no_symbol_idx]

#(iii) Convert 'Genename' vector into a single column data.frame, making it suitable for assignment to row names of 'vst_mat'
ENS_geneIDs <- stack(Genename)       

#(iv) Replace Ensembl ID entries with Gene Symbols when available, and retains original Ensembl ID if NA.
row.names(vst_mat) <- ENS_geneIDs$values                      


#Double-check that no DEGs were lost during assignment
#dim(vst_mat) #568 DEGs -> TRUE

#Scale genes z-score by row, as otherwise large values from highly expressed genes would dominate the plot
vst_mat_z <- t(scale(t(vst_mat)))

#Assign column names to each sample
colnames(vst_mat_z) <- c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5")

#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2),                        #Create segments for the color gradient.
                c("#2166AC", "white", "#B2182B"))   #Create color gradient with the specified color codes.

lgd_list <- list(title = "Z-score",                       #Set legend title.
                legend_height = unit(2.5, "cm"),          #Set height of the legend body.
                grid_width = unit(0.4, "cm"),             #Set width of each legend grid.
                title_position = "topcenter",             #Set position of the title.
                gap = unit(4, "mm"))                      #Set gap between heatmap and legend.
```


```{r Output Heatmap, echo=TRUE, results='hide'}
#Output full DEGs Heatmap
ht <- Heatmap(vst_mat_z,                                        #Matrix input for heatmap.
        col = col_fun,                                          #Set vector of colors for interpolation.
        show_column_names = FALSE,                              #Whether to show column labeling. 
        show_row_names = FALSE,                                 #Whether to show row labeling.
        #row_labels = sigs.df[rownames(vst_mat_z),]$Symbol,     #Uncomment to specify row labeling. 
        row_names_gp = gpar(fontsize = 5),                      #Set font size for row labeling.
        clustering_distance_columns = "pearson",                #Set clustering method for column dendrogram.
        clustering_distance_rows = "pearson",                   #Set clustering method for row dendrogram.
        #column_labels = colnames(vst_mat_z),                   #Uncomment for specifying column labeling. 
        column_names_gp = gpar(fontsize = 5),                   #Set font size for column labeling.
        row_dend_reorder = FALSE,                               #Whether to apply row dendrogram reordering. 
        column_dend_reorder = TRUE,                             #Whether to apply column dendrogram reordering. 
        show_row_dend = TRUE,                                   #Whether to show row dendrogram clustering.
        row_dend_width = unit(10, "mm"),                        #Set width of row dendrograms.
        show_column_dend = FALSE,                               #Whether to show column dendrogram clustering.
        width = unit(2, "in"),                                  #Set heatmap width.
        height = unit(10, "in"),                                #Set heatmap height.
        name = "Z-score")                                       #Set title.
#ht
#tiff("2D_Full_Heatmap.tiff", width = 4, height = 10, units = "in", res=600)
#draw(ht)
#dev.off()
```

![](Treated/2D_Full_Heatmap.svg)



<br>
<br>
Next, it becomes highly valuable to zoom in on **specific gene subsets** that are involved in relevant **biological processes**. Herein, DEGs related to **extracellular matrix organization**, **glycoproteins** and **secreted factors** are mapped to unveil the impact of the cell surface engineering process on the transcriptome.


```{r Heatmap Selected Subset of Genes, message=FALSE, warning=FALSE, results='hide', fig.keep='all'}
#Create vectors of gene lists
Genelist_ECM = c("ADAMTS15", "WNT16", "WNT2", "WNT9A", "ANGPTL4", "BMP4", "MMRN2", "COL21A1", "COL22A1", "OMD", "TNXB", "THSD4")
Genelist_Glycoproteins = c("ADAMTS15", "APCDD1L", "ABCA6", "ABCA8", "ABCA9", "ABCB6", "ABCC2", "ABCC3", "ATP6V0E2", "BRINP1", "CEBPB", "CD200", "EPHA5", "FAM20A", "GPR137B", "GPR68", "GPRC5A", "GPRC5B", "KCNJ5-AS1", "KITLG", "KIT", "LIF", "LYPD1", "MPL", "NOX4", "SHANK2", "SLITRK4", "ST6GALNAC5", "ST6GAL2", "B3GNT8", "WNT16", "WNT2", "WNT9A", "ADCY4", "AFP", "MGAT4B", "AREG", "ANGPT1", "ANGPTL4", "AGTR1", "AQP1", "ACKR3", "B3GALT2", "B4GALNT1", "BMP4", "CDH18", "CDH6", "CDH8", "CHST6", "CDON", "CEMIP", "CHRM2", "CSGALNACT1", "COL21A1", "COL22A1", "C1RL", "CFHR1", "CNNM4", "ENPP4", "ELAPOR1", "ESM1", "EDNRB", "EFNA5", "EPOR", "GRPR", "GPC1", "GDF15", "GDF6", "HHIP", "HGF", "HMMR", "HSD11B1", "IGSF10", "IGFBP3", "ICAM1", "IFNE", "IL1RL1", "IL20RA", "IL34", "JUP", "LRRN3", "LIPG", "ME1", "MXRA5", "MMRN2", "NDNF", "NETO2", "NOTCH4", "OMD", "PLAT", "PDGFRL", "PLXDC1", "GALNT15", "GALNT3", "POPDC3", "KCNK1", "PCSK6", "PCSK9", "PTGDR", "PTGER3", "PTGS2", "PTPRQ", "PCDHB5", "RAET1E", "RNF128", "RNF150", "SPP1", "SEMA3C", "SEMA3D", "SEMA6D", "PRSS35", "SNAI1", "SLC1A2", "SLC12A9", "SLC14A1", "SLC2A12", "SLC2A5", "SLC24A3", "SLC38A1", "SLC38A4", "SLC4A8", "SLC7A11", "SLC7A14", "TNXB", "TENM4", "THBD", "TRPC6", "TMEM130", "VCAM1")
Genelist_Secreted = c("ADAMTS15", "ABCB6", "CCL26", "CXCL1", "CXCL2", "CXCL3", "CXCL6", "CXCL8", "C1QTNF2", "FAM20A", "HTRA3", "KCNJ5-AS1", "KITLG", "LIF", "WNT16", "WNT2", "WNT9A", "AKR1B10", "AFP", "ANGPT1", "ANGPTL4", "BMP4", "CEMIP", "COL21A1", "COL22A1", "C1RL", "CFHR1", "ESM1", "EDN1", "EPOR", "FABP5", "GPC1", "GDF15", "GDF6", "HHIP", "IGSF10", "IGFBP3", "IFNE", "IL1B", "IL1RL1", "IL34", "LIPG", "MXRA5", "MMRN2", "NDNF", "OMD", "PINLYP", "PLAT", "PDGFRL", "PLXDC1", "PCSK6", "PCSK9", "RAET1E", "SPP1", "SCG5", "SEMA3C", "SEMA3D", "PRSS35", "TNXB", "THSD4")

#Replace here [Genelist_ECM] with the name of the vector for specific genes c("XX", "XY", "XZ")
selected_DEGs <- Genelist_Secreted

#Filters the matrix to retain only the rows that correspond to the genes listed in 'selected_DEGs' vector.
vst_matSelected <- vst_mat[rownames(vst_mat) %in% selected_DEGs, ]

#Double check to ensure that all selected DEGs were retrieved
#dim(vst_mat)                    #568 DEGs
#dim(vst_matSelected)            #60 DEGs
#length(Genelist_Secreted)       #60 DEGs

#Apply row z-scaling
vst_matSelected_z <- t(scale(t(vst_matSelected)))

#Assign column names to each sample
colnames(vst_matSelected_z) <- c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5")

#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2),                        #Create segments for the color gradient.
                      c("#2166AC", "white", "#B2182B"))   #Create color gradient with the specified color codes.

lgd_list <- list(title = "Z-score",                       #Set legend title.
                title_gp = gpar(fontsize = 12),           #Set font size for legend title.
                legend_height = unit(2.5, "cm"),          #Set height of the legend body.
                legend_width = unit(5, "cm"),             #Set width of the legend body.
                grid_width = unit(0.4, "cm"),             #Set width of each legend grid.
                direction = "horizontal",                 #Whether scale is 'horizontal' or 'vertical'
                title_position = "topcenter",             #Set position of the title.
                gap = unit(4, "mm"))                      #Set gap between heatmap and legend.


ht_selected <- Heatmap(vst_matSelected_z,                 #Matrix input for heatmap.
        col = col_fun,                                    #Vector of colors for interpolation.
        rect_gp = gpar(col = "grey30", lwd = 0.15),       #Border line color and width of each cell.
        show_column_names = FALSE,                        #Whether to show column labeling.
        show_row_names = TRUE,                            #Whether to show row labeling.
        row_names_gp = gpar(fontsize = 5),                #Set font size for row labeling.
        clustering_distance_columns = "pearson",          #Set clustering method for column dendrogram.
        clustering_distance_rows = "pearson",             #Set clustering method for row dendrogram.
        #column_labels = colnames(vst_matSelected_z),     #Uncomment to specify column labeling.
        column_names_gp = gpar(fontsize = 5),             #Set font size for column labeling.
        row_dend_reorder = FALSE,                         #Whether to apply row dendrogram reordering.
        column_dend_reorder = TRUE,                       #Whether to apply column dendrogram reordering. 
        show_row_dend = FALSE,                            #Whether to show row dendrogram clustering.
        row_dend_width = unit(10, "mm"),                  #Set width of row dendrograms.
        show_column_dend = FALSE,                         #Whether to show column dendrogram clustering.
        width = unit(3, "in"),                            #Set heatmap width.
        height = unit(3.5, "in"),                         #Set heatmap height.
        heatmap_legend_param = lgd_list,                  #Reference object for legend parameters.
        #show_heatmap_legend = FALSE,                     #Uncomment to hide heatmap legend.
        name = "Z-score")                                 #Set title.

ht_selected = draw(ht_selected, heatmap_legend_side = "bottom") #Draws heatmap with legend at the bottom position.

#Output settings for each heatmap (W = width, H = height)
#2D_ECM: W = 3 x H = 3.5 | Output svg: W = 5 x H = 5
#2D_Secreted: W = 3 x H = 6.5 | Output svg: W = 4 x H = 4.5
#2D_Glycoproteins: W = 3 x H = 3.5 | Output svg: W = 4 x H = 4.5

svg(filename = "2D_Heatmap_Secreted.svg", width = 4, height = 4.5)
draw(ht_selected, heatmap_legend_side = "bottom")
dev.off()
```


```{r rowAnnotation to be used within right_annotation not working yet, include=FALSE}
#ha = rowAnnotation(foo = anno_mark(at = mark_at, labels = Genes_DoubleCheck,
                  #link_width = unit(5, "mm"),
                  #padding = 0.5,
#                  labels_gp = gpar(fontsize = 10
                     #, fontface = "bold"  
#                                   )))
#ha = rowAnnotation(foo = anno_mark(at = which(rownames(vst_matSelected_z) %in% Genes_Label),
#                                    labels = rownames(vst_matSelected_z)[rownames(vst_matSelected_z)%in%Genes_Label],
#                                   which = "row"))
```

<br>
In large **Heatmaps**, in order to avoid visual cluttering by labeling every gene, it is often valuable to label only a subset of genes.

```{r Heatmap Specific Gene Labeling, message=FALSE, warning=FALSE, results='hide', fig.keep='last'}
#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2),                            #Create segments for the color gradient.
                      c("#2166AC", "white", "#B2182B"))       #Create color gradient with the specified color codes.

lgd_list <- list(title = "Z-score",                           #Set legend title.
                title_gp = gpar(fontsize = 12),               #Set font size for legend title.
                legend_height = unit(2.5, "cm"),              #Set height of the legend body.
                legend_width = unit(5, "cm"),                 #Set width of the legend body.
                grid_width = unit(0.4, "cm"),                 #Set width of each legend grid.
                direction = "horizontal",                     #Whether scale is 'horizontal' or 'vertical'
                title_position = "topcenter",                 #Set position of the title.
                gap = unit(4, "mm"))                          #Set gap between heatmap and legend.


#Create vector with subset of genes to label
Glycoproteins_select_label = c("ANGPT1", "B3GNT8", "MGAT4B", "EDNRB", "NDNF", "ADAMTS15", "ST6GALNAC5", "BMP4", "CDH8", "COL21A1", "IL34", "ICAM1", "NOTCH4", "OMD", "SEMA3D", "VCAM1")
Secreted_select_label = c("WNT2", "WNT16", "CXCL1", "CXCL2", "BMP4", "COL22A1", "EDN1", "GDF6", "IL1B", "CEMIP", "LIPG", "SCG5", "MXRA5", "IFNE", "SPP1", "ANGPT1", "NDNF", "KITLG", "AKR1B10")

#Assign genes subset to a static object for easily switching between plots
Genes_Label = Secreted_select_label

#Identify indices of the row names in the matrix that correspond to the genes listed in the vector 'Genes_Label' 
mark_at = which(rownames(vst_matSelected_z) %in% Genes_Label)

#Extract the row names of the matrix that correspond to indices in 'mark_at'. This step is a double-check to ensure that the selected rows do indeed match the original 'Genes_Label' vector.
Genes_DoubleCheck <- rownames(vst_matSelected_z)[mark_at]

#Check whether the two vectors contain the same elements, regardless of their order.
setequal(Genes_DoubleCheck, Genes_Label) #TRUE -> proceed to labeling.

#Note: Use 'Genes_DoubleCheck' object for labeling, because it outputs Gene symbols in the same order as 'mark_at'. This is necessary because 'mark_at' extracts row name numbers, which are automatically ordered numerically in the output vector.

#'anno_mark' function in ComplexHeatmap package enables linking heatmap annotation with customized labels.
foo = anno_mark(at = mark_at,                                 #Numeric index from original matrix.
                #link_width = unit(5, "mm"),                  #Uncomment to specify label link width.
                padding = 0.75,                               #Padding space between neighboring labels.
                labels = Genes_DoubleCheck,                   #Corresponding labels.
                labels_gp = gpar(fontsize = 10),              #Sets font size.
                which = "row")                                #Whether 'column' or 'row' annotation.

ht_selected_label <- Heatmap(vst_matSelected_z,               #Matrix input for heatmap.
                 col = col_fun,                               #Vector of colors for interpolation.
                 rect_gp = gpar(col = "grey30", lwd = 0.15),  #Border line color and width of each cell.
                 show_column_names = FALSE,                   #Whether to show column labeling.
                 show_row_names = FALSE,                      #Whether to show row labeling.
                 #row_names_gp = gpar(fontsize = 4),          #Uncomment to specify row label font size.
                 #row_names_side = "left",                    #Uncomment to specify row label side.
                 clustering_distance_columns = "pearson",     #Set clustering method for column dendrogram.
                 clustering_distance_rows = "pearson",        #Set clustering method for row dendrogram.
                 row_dend_reorder = FALSE,                    #Whether to apply row dendrogram reordering.
                 column_dend_reorder = TRUE,                  #Whether to apply column dendrogram reordering.
                 show_row_dend = FALSE,                       #Whether to show row dendrogram clustering.
                 show_column_dend = FALSE,                    #Whether to show column dendrogram clustering.
                 width = unit(3, "in"),                       #Set heatmap width.
                 height = unit(3.5, "in"),                    #Set heatmap height.
                 heatmap_legend_param = lgd_list,             #Reference object for legend parameters.
                 name = "Z-score")                            #Set title.

ht_merge = ht_selected_label + rowAnnotation(mark = foo)
ht_merge = draw(ht_merge, heatmap_legend_side = "bottom", align_heatmap_legend = "heatmap_left") #Note: 'align_heatmap_legend' = 'heatmap_center' position is offset because heatmap has been merged with 'rowAnnotation'.
svg(filename = "2D_Heatmap_Secreted_CompactLabel.svg", width = 4.5, height = 4.5)
dev.off()
```





